2 research outputs found
Detection of nonverbal vocalizations using Gaussian Mixture Models: looking for fillers and laughter in conversational speech
In this paper, we analyze acoustic profiles of fillers (i.e. filled pauses, FPs) and laughter with the aim to automatically localize these nonverbal vocalizations in a stream of audio. Among other features, we use voice quality features to capture the distinctive production modes of laughter and spectral similarity measures to capture the stability of the oral tract that is characteristic for FPs. Classification experiments with Gaussian Mixture Models and various sets of features are performed. We find that Mel-Frequency Cepstrum Coefficients are performing relatively well in comparison to other features for both FPs and laughter. In order to address the large variation in the frame-wise decision scores (e.g., log-likelihood ratios) observed in sequences of frames we apply a median filter to these scores, which yields large performance improvements. Our analyses and results are presented within the framework of this year’s Interspeech Computational Paralinguistics sub-Challenge on Social Signals
Speech dereverberation and speaker separation using microphone arrays in realistic environments
This thesis concentrates on comparing novel and existing dereverberation and speaker
separation techniques using multiple corpora, including a new corpus collected using
a microphone array. Many corpora currently used for these techniques are recorded
using head-mounted microphones in anechoic chambers. This novel corpus contains
recordings with noise and reverberation made in office and workshop environments.
Novel algorithms present a different way of approximating the reverberation, producing results that are competitive with existing algorithms.
Dereverberation is evaluated using seven correlation-based algorithms and applied to two different corpora. Three of these are novel algorithms (Hs NTF, Cauchy
WPE and Cauchy MIMO WPE). Both non-learning and learning algorithms are
tested, with the learning algorithms performing better.
For single and multi-channel speaker separation, unsupervised non-negative matrix factorization (NMF) algorithms are compared using three cost functions combined with sparsity, convolution and direction of arrival. The results show that the
choice of cost function is important for improving the separation result. Furthermore, six different supervised deep learning algorithms are applied to single channel
speaker separation. Historic information improves the result. When comparing
NMF to deep learning, NMF is able to converge faster to a solution and provides a
better result for the corpora used in this thesis